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Abstract 


function produces output vector x = (xi ,...,Xn) such that 


The outputs of a trained neural network contain 
much richer information than just a one-hot clas¬ 
sifier. For example, a neural network might give an 
image of a dog the probability of one in a million of 
being a cat but it is still much larger than the prob¬ 
ability of being a car. To reveal the hidden struc¬ 
ture in them, we apply two unsupervised learning 
algorithms, PCA and ICA, to the outputs of a deep 
Convolutional Neural Network trained on the Ima- 
geNet of 1000 classes. The PCA/ICA embedding 
of the object classes reveals their visual similarity 
and the PCA/ICA components can be interpreted 
as common visual features shared by similar ob¬ 
ject classes. For an application, we proposed a new 
zero-shot learning method, in which the visual fea¬ 
tures learned by PCA/ICA are employed. Our zero- 
shot learning method achieves the state-of-the-art 
results on the ImageNet of over 20000 classes. 


1 Introduction 


Recently, Convolutional Neural Network 


(CNN) yLeCunefa/., 1998) 
cant advances in computer 

has made signifi- 

vision tasks such 

as image classification 

1 Ciresan et al., 2012| 

Krizhevsky ef a/., 2012 Szegedy ef a/., 2015) , object 

detection yCirshick et al., 2014[ 

Shaoqing Ren, 2015 

and image segmentation 

ITuragaef a/., 2010 


Long et ai, 2015) . Moreover, CNN also sheds lights 
on neural coding in visual cortex. In | |Cadieu et ai, 201^ , it 
has been shown that a trained CNN rivals the representational 
performance of inferior temporal cortex on a visual object 
recognition task. Therefore, investigating the properties 
of a trained CNN is important for both computer vision 
applications and discovering the principles of neural coding 
in the brain. 

In yHinton et ai, 2014) , it is shown that the softmax out¬ 
puts of a trained neural network contain much richer infor¬ 
mation than just a one-hot classifier. Such a phenomenon is 
caWed dark knowledge. For input vector y = (i/i, 
which is called logits in ||Hinton et ai, 2014), the softmax 


exp(y,/r) 
Ej exp(2/j/T) 


( 1 ) 


where T is the temperature parameter. The softmax func¬ 
tion assigns positive probabilities to all classes since Xi > Q 
for all i. Given a data point of a certain class as input, 
even when the probabilities of the incorrect classes are small, 
some of them are much larger than the others. For ex¬ 
ample, in a 4-class classification task (cow, dog, cat, car), 
given an image of a dog, while a hard target (class label) is 
(0,1, 0,0), a trained neural network might output a soft target 
(10“®, 0.9,0.1,10“®). An image of a dog might have small 
chance to be misclassified as cat but it is much less likely to 
be misclassified as car. In | |Hinton et ai, 2014) , a technique 
called knowledge distillation was introduced to further reveal 
the information in the softmax outputs. Knowledge distilla¬ 
tion raises the temperature T in the softmax function to soften 
the outputs. For example, it transforms (10“®, 0.9,0.1,10“®) 
to (0.015, 0.664, 0.319, 0.001) by raising temperature T from 
1 to 3. It has been shown that adding the distilled soft targets 
in the objective function helps in reducing generalization er¬ 
ror when training a smaller model of an ensemble of models 
yHinton et al, 2014) . Therefore, the outputs of a trained neu¬ 
ral network are far from one-hot hard targets or random noise 
and they might contain rich statistical structures. 


In this paper, to explore the information hidden in the out¬ 
puts, we apply two unsupervised learning algorithms. Prin¬ 
ciple Component Analysis (PCA) and Independent Compo¬ 
nent Analysis (ICA) to the outputs of a CNN trained on the 
ImageNet dataset I Deng et al, 2009| of 1000 object classes. 
Both PCA and ICA are special cases of the Factor Analy¬ 
sis model, with different assumptions on the latent variables. 
Factor Analysis is a statistical model which can be used for 
revealing hidden factors that underlie a vector of random vari¬ 
ables. In the case of CNN for image classification, the neu¬ 
rons or computational units in the output layer of a CNN, as 
random variables, represent object classes. A latent factor 
might represent a common visual attribute shared by several 
object classes. It is therefore desirable to visualize, interpret 
and make use of the Factor Analysis models learned on the 
outputs of a trained CNN. 




































2 Softmax 

Because a CNN was trained with one-hot hard targets (class 
labels), given a training image as input, the softmax function 
suppresses the outputs of most neurons in the output layer and 
leaves one or a few peak values. For example, in Figure[T](a), 
we show the softmax (T = 1) outputs for a training image. To 
magnify the tiny values in the softmax outputs, after a CNN 
was trained with softmax function (T = 1), we take the logits 
y in e.q. ([T]i and apply the following normalization function 

{yi-mmkVk) 

Xi = ^-r (2) 

for all i, as the outputs of the CNN, with all the parameters 
in the CNN unchanged. This function normalizes y so that x 
in eq. (|2|l is still a probability distribution over classes. We 
call the X in eq. (|2]) normalized logits. In Figure [T] (b), we 
show the outputs of this function given the same input image 
as Figure[T](a). 
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Figure 2; Kurtosis 
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Figure 1; Outputs 

In order to apply ICA, the variables must not all be Gaus¬ 
sian. The non-Gaussianity of a random variable x of zero 
mean can be measured by kurtosis E{x'^)/E{x'^)'^ — 3, which 
is zero if x is Gaussian. We computed the kurtosis of the out¬ 
puts (mean removed) of a CNN with softmax and normalized 
logits using all the ImageNet ILSVRC2012 training data. The 
CNN model and experimental settings are described in Sec- 
tionlH The result is, all neurons in the output layer have pos¬ 
itive kurtosis, as shown in Figure|2] Therefore the neurons as 
random variables are highly non-Gaussian and it is sensible 
to apply ICA, which is introduced in the next section. 

3 Factor Analysis 

In Factor Analysis, we assume the observed variables x = 
{xi, ...,Xn) are generated by the following model 

X = As + n (3) 

where s = (si,..., s„) are the latent variables, A is the model 
parameter matrix and n are the noise variables. Here, x and 


s are assumed to have zero-mean, s are also assumed to be 
uncorrelated and have unit variance, in other words, white. 

3.1 Principle Component Analysis 

Principle Component Analysis (PC A) is a special case of Fac¬ 
tor Analysis. In PCA, s are assumed to be Gaussian and n are 
assumed to be zero (noise-free). Let C denote the covariace 
matrix of x, E = (ei,..., e„) denote the matrix of eigenvec¬ 
tors of C and D = diag(Ai,..., A„) denote the diagonal ma¬ 
trix of eigenvalues of C. The PCA matrix is E^, the whiten¬ 
ing matrix is U = and the whitened variables are 

z = Ux. 


3.2 Independent Component Analysis 

Independent Component Analysis (ICA) 

I Hyvarinen et al, 2004| is another special case of Fac¬ 
tor Analysis. In ICA, s are assumed to be non-Gaussian 
and independent and n are assumed to be zero. ICA seeks 
a demixing matrix W such that Wx can be as independent 
as possible. To obtain W, we can first decompose it as 
W = VU, where U is the whitening matrix and V is an 
orthogonal matrix, which can be learned by maximizing the 
non-Gaussianity or the likelihood function of VUx. The 
non-Gaussianity can be measured by kurtosis or negentropy. 
If dimensionality reduction is required, we can take the d 
largest eigenvalues and the corresponding eigenvectors for 
the whitening matrix U. As a result, the size of U is d x n 
and the size of V is d x d. Scaling each component does 
not affect ICA solutions. If W is an ICA demixing matrix, 
then diag(ai,..., ad)W is also an ICA demixing matrix, 
where {ai, ...,ad} are non-zero scaling constants of the 
components. 


A classic ICA algorithm is FastICA I Hyvarinen, 19991. 
Despite its fast convergence, FastICA is a batch algorithm 
which requires all the data to be loaded for computation 
in each iteration. Thus, it is unsuitable for large scale 
applications. To handle large scale datasets, we use a 
stochastic gradient descent (SGD) based ICA algorithm (de- 


























scribed in the Appendix of i Hyvarinen, 1999| ). For samples 
{z(l), z(2),...}, one updating step of the SGD-based algo¬ 
rithm of a given sample z{t) is: 


V ^ V + f,giVz{t))z{tf + - (I - VV^)V^ (4) 


where p is the learning rate, g{-) = — tanh(-) and I is an 
identity matrix. In our experiments, V was initialized as a 
random orthogonal matrix. 

Like FastICA, this SGD-based algorithm requires going 
through all data once to compute the whitening matrix U. 
But unlike FastICA, this SGD-based algorithm does not re¬ 
quire projection or orthogonalization in each step. 

In this algorithm, the assumption on the probability distri¬ 
bution of each Si is a super-Gaussian distribution 

logp(si) = — logcosh(si) -I- constant (5) 

and therefore 

d 

g{si) = — logp(si) = - tanh(sO- (6) 


Since the variables obtained by linear transformations of 
Gaussian variables are also Gaussian, from Section|2] we can 
infer at least one neuron in the output layer is non-Gaussian. 
As an initial attempt, we choose a particular non-Gaussian 
distribution here. Explorations of different non-Gaussian dis¬ 
tributions and therefore different nonlinearities g{ ) are left 
for future research. 


4 Results 

4.1 Experimental Settings 


For the trained CNN model, we used 
GoogLeNet jSzegedy ef a/., 2015) and AlexNet 
i Krizhevsky et al, 2012| . The results of using two dif¬ 
ferent CNN models are similar. Therefore, due to the space 
limitation, we only report the results of using GoogLeNet. 
We used all the images in the ImageNet ILSVRC2012 
training set to compute the ICA matrix using our SGD-based 
algorithm with mini-batch size 500. The learning rate was set 
to 0.005 and was halved every 10 epochs. The computation 
of CNN outputs was done with Caffe I Jia ef a/., 2014|. T he 
ICA algorithm was ran with Theano IjBergstra et al., 20T0). 


4.2 Visualization of PCA/ICA components 

To understand what is learned by PCA and ICA, we visualize 
the PCA and the ICA matrices. In the PCA matrix or ICA 
matrix W, each row corresponds to a PCA/ICA component 
and each column corresponds to an object class. The num¬ 
ber of rows depends on the dimensionality reduction. The 
number of columns of or W is 1000, corresponding to 
1000 classes. After the ICA matrix was learned, each ICA 
component (a row of W) was scaled to have unit I 2 norm. 
The scaling of each ICA component does not affect the ICA 
solution, as discussed in Section lSTSl 

In Figure|3l we show the embedding of class labels by PCA 
and ICA. The horizontal and the vertical axes are two distinct 
rows of E^ or W. Each point in the plot corresponds to an 
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(a) PCA, softmax. 
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Eigure 3: Label embedding of object class by PCA/ICA com¬ 
ponents. In each plot, each point is an object class and each 
axis is a PCA/ICA component (PC/IC). Eor visual clarity, 
only selected points are annotated with object class labels. 


object class. And there are 1000 points in each plot. Dimen¬ 
sionality is reduced from 1000 to 200 in ICA. In Eigure 0 
(a) and (b), we plot two pairs of the PCA/ICA components, 
learned with softmax outputs. In the PCA embedding, vi¬ 
sually similar class labels are along some lines, but not the 
axes, while in the ICA embedding, they are along the axes. 
However, most points are clustered in the origin. In Eigure [3 






















Table 1: Object classes ranked by single components of PCA/ICA 



1 

2 

3 

4 

PCA 

mosque 

shoji 

trimaran 

fire screen 

aircraft carrier 

killer whale 

beaver 

valley 

otter 

loggerhead 

Model T 
strawberry 
hay 

electric locomotive 

scoreboard 

zebra 

tiger 

chickadee 

school bus 

yellow lady’s slipper 

ICA 

mosque 

bam 

planetarium 

dome 

palace 

killer whale 
grey whale 
dugong 

leatherback turtle 

sea lion 

Model T 

car wheel 

tractor 

disk brake 

barn 

zebra 

tiger 

triceratops 
prairie chicken 
warthog 


Table 2: Closest object classes in terms of visual and semantic similarity 



Egyptian cat 

soccer ball 

mushroom 

red wine 


tabby cat 

rugby ball 

bolete 

wine bottle 


tiger cat 

croquet ball 

agaric 

beer glass 

Visual 

tiger 

racket 

stinkhorn 

goblet 


lynx 

tennis ball 

earths tar 

measuring cup 


Siamese cat 

football helmet 

hen-of-the-woods 

wine bottle 


Persian cat 

croquet ball 

cucumber 

eggnog 


tiger cat 

golf ball 

artichoke 

cup 

Semantic 

Siamese cat 

baseball 

cardoon 

espresso 


tabby cat 

ping-pong ball 

broccoli 

menu 


cougar 

punching bag 

cauliflower 

meat loaf 


(c) and (d), we plot two pairs of the PCA/ICA components, 
learned with normalized logits outputs. We can see the class 
labels are more scattered in the plots. 

In Figure 01 we show the PCA/ICA componenents of two 
sets of similar object classes: (1) Border terrier, Lerry blue 
terrier, and Irish terrier. (2) trolleybus, minibus, and sports 
car. Both PCA and ICA were learned on the softmax out¬ 
puts and the dimensionality were reduced to 20 for better 
visualization. In Figure 0] (a), we see the PCA compo¬ 
nents of the object classes are distributed. While in Figure 
|4](b), we see clearly some single components of ICA dom¬ 
inating. There are components representing ”dog-ness” and 
”car-ness”. Therefore, the ICA components are more inter¬ 
pretable. 

In Table [1] we show the top-5 object classes according to 
the value of PCA/ICA components. For the ease of com¬ 
parison, we selected each PCA/ICA component which has 
the largest value for class mosque, killer whale. Model T or 
zebra among all components. We can see that the class la¬ 
bels ranked by ICA components are more visually similar and 
consistent than the ones by PCA components. 

The PCA/ICA components can be interpreted as common 
features shared by visually similar object classes. From Fig¬ 
ure 0] and Table [T] we can see the label embeddings of ob¬ 
ject classes by PCA/ICA components are meaningful since 
visually similar classes are close in the embeddings. Unlike 
| |Akata et al., 2013] , these label embeddings can be unsuper- 
visedly learned with a CNN trained with only one-hot class 
labels and without any hand annotated attribute label of the 


object classes, such as has tail or lives in the sea. 

4.3 Visual vs. Semantic Similarity 

The visual-semantic similarity relationship was previously 
explored in [ [Deselaers and Ferrari, 2011) , which shows some 
consistency between two similarities. Here we further ex¬ 
plore it from another perspective. We define the visual and 
the semantic similarity in the following way. The visual sim¬ 
ilarity between two object classes is defined as cosine simi¬ 
larity of their PCA or ICA components (200-dim and learned 
with softmax), both of which give the same results. The se¬ 
mantic similarity is defined based on the shortest path length^ 
between two classes on the WordNet graph | |Fellbaum, 1998| . 

In Table|2] we compare five closest classes of Egyptian cat, 
soccer ball, mushroom and red wine in terms of visual and 
semantic similarities. For Egyptian cat, both visual and se¬ 
mantic similarities give similar results. For soccer ball, foot¬ 
ball helmet is close in terms of visual similarity but distant in 
terms of semantic similarity. For mushroom and red wine, two 
similarities give very different closest object classes. The gap 
between two similarities is intriguing and therefore worth fur¬ 
ther exploration. In neuroscience literature, it is claimed that 
visual cortex representation favors visual rather than semantic 
similarity ||Baldassi et al, 2013|. 


'Computed with the path_similarity() function in the NLTK tool 

http://www.nltk.org/howto/wordnet.html 
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Figure 5: Our zero-shot learning method. are the visual 
features of the seen classes. are the semantic features of 
the seen classes. are the semantic features of the unseen 
classes. is the projection matrix from visual space to the 
common space. is the projection matrix from semantic 
space to the common space. /(•) is the li normalization. M 
are the mean vectors of the seen classes, x is the CNN output 
vector. 


Figure 4; Bar plots of PCA/ICA componoments of object 
classes. Dimensionality was reduced to 20 for better visu¬ 
alization. 


5 Application: Zero-shot Learning 

To demonstrate the effectiveness of the visual fea¬ 
tures of object classes learned by PCA and ICA, 
we apply them to zero-shot learning. Zero-shot 
learning ||Larochelle et a/., 2008[ Lampert et ai, 2009 
Palatucci et aL, 20W\ [Rohrbach et ai, 2011 


Socher et ai, 2073r is a classihcation task in which some 


classes have no training data at all. We call the classes which 
have training data seen classes and those which have no 
training data unseen classes. One can use external knowledge 
of the classes, such as attributes, to build the relationship 
between the seen and the unseen classes. Then one can 
extrapolate the unseen classes by the seen classes. 

Note that the focus of this paper is not zero-shot learn¬ 
ing, but the visual features learned by PCA and ICA on the 
CNN outputs. Our purpose here is to give an example of how 
PCA and ICA features can be used for computer vision ap¬ 
plications. Therefore, we do not intend to provide a compre¬ 
hensive comparison or review of different zero-shot learning 
methods. 


5.1 Previous Work 

Previous state-of-the-art large scale zero-shot learning 
methods are DeViSE | |Frome et aL, 2013) and conSE 
yNorouzi et ai, 20T^ . Both of them use the ImageNet of 
1000 classes for training and the ImageNet of over 20000 
classes for testing. 

In DeViSE, a CNN is first pre-trained on the ImageNet 


of 1000 classes. Then, 500-dimensional semantic features 
of both seen and unseen classes are obtained by running 
word2vec | |Mrkolov et aL, 2013[ on Wikipedia. After that, 
the last (softmax) layer of the CNN is removed and all the 
other parts of the CNN are fun-tuned to predict the semantic 
features of the seen classes for each training image. In testing, 
when a new image arrives, the prediction is done by comput¬ 
ing the cosine similarity of the CNN output vector and the 
semantic features of classes. In | |Erome et a/., 2013] , it has 
also been shown that DeViSE could give more semantically 
reasonable errors for the seen classes. 

In conSE, a CNN is first trained on the ImageNet of 1000 
classes and 500-dimensional semantic features of the classes 
are obtained by running word2vec on Wikipedia, as in De¬ 
ViSE. However, conSE does not require fun-tuning the CNN 
to predict the semantic features. The output vector in conSE 
is a convex combination of the semantic features, by the top 
activated neurons in the softmax layer. Its testing procedure 
is the same as DeViSE. In | |Norouzi et aL, 20T4) , it has been 
shown that conSE gives better performance than DeViSE in 
the large scale zero-shot learning experiments. 

Our method differs from DeViSE and conSE by using un¬ 
supervised learning algorithms to learn; (1) visual features of 
classes. (2) a semantic features of classes from the WordNet 
graph, instead of Wikipedia. (3) a bridge between the visual 
and the semantic features. 

5.2 Our Method 

Our method works as follows. In the learning phase, first as¬ 
sume we have obtained the visual feature vectors = 

..., of n seen classes. Let M = (mi,...,m„) 
denotes the matrix of the mean outputs of a CNN of the seen 
classes. And F = /(W^^^M) = (fi,...,f„) are the trans- 






















































































formed mean outputs of the seen classes, where /(•) is a non¬ 
linear function. Next, assume we have obtained the semantic 


feature vectors = (wj , ) of n seen classes 

andW(3) = of m unseen classes. Due to 


the visual-semantic similarity gap shown in Section 14.31 we 
learn a bridge between the visual and the semantic represen¬ 
tations of object classes via Canonical Correlation Analysis 
(CCA) I Hotelling, 1936 [Hardoon et al, 2004) , which seeks 
two projection matrices and such that 


min ||P(1)^F-P(2)rw(2)||^ (7) 

P(1),P(2) 

s.t. pW^CfefcP«=I, ( 8 ) 

fc, Z = l,2, k^l, i,j = l,...,d, (9) 


where p^^ is the i-th column of P*^^^ and Cm is a co- 
variance or cross-covariance matrix of {fi,...,f„} and/or 

In the testing phase, when a new image arrives, we first 
compute its CNN output x. Then for — 

ra Si compute its k closest columns of P( 2 )t-y^( 2 ) 


(seen) and/or p( 2 )T^( 3 ) (unseen). The corresponding 
classes of these k columns are the top-/c predictions. The 
closeness is measured by cosine similarity. 

For we compare random, PCA, and ICA matrices 

of different dimensionality in our experiments. The random 
matrices are semi-orthogonal, that is, = I but 

7 / I. For W(2) and W(3), we use the feature 
vectors by running classic Multi-dimensional Scaling (MDS) 
on a distance matrix of both seen and unseen classes. The 
distance between two classes is measured by one minus the 
similarity in Section |43] Each column of and 
is subtracted by i M is approximated by I and 

/(•) is the scaling normalization of a vector or each column 
of a matrix to unit h norm. We experimented with softmax 
with different T and normalized logits as the outputs. The 
best performance (as in Table |4] [3l IS was obtained with the 
softmax (T = 1) output for x but and W were learnt with 
normalized logits. 

In our method, instead of using word2vec on Wikipedia 
as in DeViSE and conSE, we use classic MDS of the Word- 
Net distance matrix to obtain the semantic features of classes, 
for simplicity. Word embedding on Wikipedia typically con¬ 
sumes a large amount of RAM and takes hours for computa¬ 
tion. While classic MDS on the WordNet distance matrix of 
size 21842x218441 is much cheaper to compute. The com¬ 
putation of a 21632-dimensionalMDS feature vector for each 
class was done in MATLAB with 8 Intel Xeon 2.5GHz cores 
within 12 minutes. A comprehensive comparisons of differ¬ 
ent semantic features of classes for zero-shot learning can be 
found in |Akata et ah, 2015| . 


5.3 Experiments 

Eollowing the zero-shot learning experimental settings of 
DeViSE and conSE, we used a CNN trained on ImageNet 
ILSVRC2012 (1000 seen classes), and test our method 
to classify images in ImageNet 2011 fall (20842 unseen 
classes Q, 21841 both seen and unseen classes). We use 
top-fc accuracy (also called flat hit@/c in | |Erome et ah, 2013] 
[Norouzi et al, 201^ ) measure, the percentage of test images 
in which a method’s top-fc predictions return the true label. 

Eor the trained CNN model, we experimented with 
GoogLeNet and AlexNet. Although GoogLeNet outperfor- 
mans AlexNet on the seen classes, our method with the two 
different CNN models performans essentially the same on the 
zero-shot learning tasks. Due to the space limitation, we only 
report the results of using GoogLeNet. 

The sizes of the matrices in our methods: is /c x 1000, 

W(2) is 21632x1000, is 21632x20842, is fc x fc, 
P(2) is fcx21632, M is 1000x1000 and x is 1000x1. We 
used k = 100, 500, 900 in our experiments. Although 
and W(3) are large matrices, we only need to compute once 
and store p( 2 )w( 2 ) p( 2 )-yy( 3 ) fcxlOOO and 

fcx21632, respectively. 

In Table [S] we show the results of the three zero- 
shot learning methods on the test images selected in 
I INorouzi et al, 20T^ . Same as conSE, our method gives cor¬ 
rect or reasonable predictions. 

In Table |4] we show the results of different methods on 
ImageNet 201 Ifall. Our method performs better when using 
PCA or ICA for the visual features than random features. And 
our method with random, PCA, or ICA features, achieves the 
state-of-the-art records on this zero-shot learning task. 

In Table |5] we show the results of different methods on 
ImageNet ILSVRC2012 validation set of 1000 seen classes. 
While the goal here is not to classify images of seen classes, it 
is desirable to measure how much accuracy a zero-shot learn¬ 
ing method would lose compared to the softmax baseline. 
Again, we can see that our method performs better using PCA 
or ICA for the visual features than random features. 

The results show that in our method the PCA or ICA 
matrix as visual features of object classes performs better 
than a random matrix. Therefore, these visual features, 
learned PCA and ICA on the outputs of CNN, are indeed 
effective for the subsequent tasks. The results also show 
that PCA and ICA give the essentially same classification 
accuracy. Therefore, in practice we can use PCA instead 
of ICA, which has much higher computational costs. Eor 
a more comprehensive discussion on PCA vs. ICA for 
recognition tasks, see | Asuncion Vicente et al, 2007] . The 
code for reproducing the experiments is in 

https://github.com/yaolubrain/ULNNO 


^21841 classes in ImageNet 201 Ifall plus class teddy, teddy 
bear. Class teddy, teddy bear (WordNet ID: n04399382) is in Ima¬ 
geNet ILSVRC2012 but not in ImageNet 201 Ifall. 


^Since class teddy, teddy bear is missing in ImageNet 201 Ifall, 
the correct number of classes is 21841—(1000—1) = 20842 rather 
than 20841. 


















Table 3: Predictions of test images of unseen classes (correct class labels are in blue) 


Test Images 

DeViSE (Frome et al., 2013) 

ConSE (Norouzi et al., 20I4J 

Our Method 

1 

ri 


water spaniel 

tea gown 

bridal gown, wedding gown 
spaniel 

tights, leotards 

business suit 

dress, frock 

hairpiece, false hair, postiche 
swimsuit, swimwear, bathing suit 
kit, outfit 

periwig, peruke 
horsehair wig 
hound, hound dog 
bonnet macaque 
toupee, toupe 




heron 

owl, bird of Minerva, bird of night 

hawk 

bird of prey, raptor, raptorial bird 

finch 

ratite, ratite bird, flightless bird 

peafowl, bird of Juno 
common spoonbill 

New World vulture, cathartid 

Greek partridge, rock partridge 

ratite, ratite bird, flightless bird 

kiwi, apteryx 

moa 

elephant bird, aepyornis 
emu, Dromaius novaehollandiae 

! 



elephant 

turtle 

turtleneck, turtle, polo-neck 
flip-flop, thong 

handcart, pushcart, cart, go-cart 

California sea lion 

Steller sea lion 

Australian sea lion 

South American sea lion 

eared seal 

fur seal ° 

eared seal 
fur seal^ 
guadalupe fur seal 

Alaska fur seal 


^ - 

• >» 

, >v 


golden hamster, Syrian hamster 

rhesus, rhesus monkey 
pipe 

shaker 

American mink, Mustela vison 

golden hamster, Syrian hamster 

rodent, gnawer 

Eurasian hamster 

rhesus, rhesus monkey 
rabbit, coney, cony 

golden hamster, Syrian hamster 

Eurasian hamster 

prairie dog, prairie marmot 

skink, scincid, scincid lizard 

mountain skink 

1 



truck, motortruck 

skidder 

tank car, tank 

automatic rifle, machine rifle 
trailer, house trailer 

flatcar, flatbed, flat 
truck, motortruck 

tracked vehicle 

bulldozer, dozer 

wheeled vehicle 

farm machine 

cultivator, tiller 

skidder 

bulldozer, dozer 
haymaker, hay conditioner 

y 



kernel 

littoral, litoral, littoral zone, sands 

carillon 

Cabernet, Cabernet Sauvignon 
poodle, poodle dog 

dog, domestic dog 
domestic cat, house cat 

schnauzer 

Belgian sheepdog 
domestic llama. Lama peruana 

mastiff 

alpaca. Lama pacos 

domestic llama, Lama peruana 
guanaco. Lama guanicoe 

Seeing Eye dog 


Table 4; Top-fc accuracy in ImageNet 201 Ifall zero-shot learning task (%) 


Test Set 

#Classes 

#Images 

Method 

Top-1 

Top-2 

Top-5 

Top-10 

Top-20 




DeViSE (500-dim) 

0.8 

1.4 

2.5 

3.9 

6.0 




ConSE (500-dim) 

1.4 

2.2 

3.9 

5.8 

8.3 




Our method (100-dim, random) 

1.4 

2.2 

3.4 

4.3 

5.2 




Our method (100-dim, PC A) 

1.6 

2.7 

4.6 

6.4 

8.6 




Our method (100-dim, ICA) 

1.6 

2.7 

4.6 

6.3 

8.5 

Lfnseen 

20842 

12.9 million 

Our method (500-dim, random) 

1.8 

2.9 

5.0 

6.9 

8.8 




Our method (500-dim, PCA) 

1.8 

3.0 

5.2 

7.3 

9.6 




Our method (500-dim, ICA) 

1.8 

3.0 

5.2 

7.3 

9.7 




Our method (900-dim, random) 

1.8 

3.0 

5.1 

7.2 

9.6 




Our method (900-dim, PCA) 

1.8 

3.0 

5.2 

7.3 

9.7 




Our method (900-dim, ICA) 

1.8 

3.0 

5.2 

7.3 

9.7 




DeViSE (500-dim) 

0.3 

0.8 

1.9 

3.2 

5.3 




ConSE (500-dim) 

0.2 

1.2 

3.0 

5.0 

7.5 




Our method (100-dim, random) 

6.7 

8.2 

10.0 

11.1 

12.1 




Our method (100-dim, PCA) 

6.7 

8.1 

10.3 

12.4 

14.8 




Our method (100-dim, ICA) 

6.7 

8.1 

10.4 

12.4 

14.7 

Both 

21841 

14.2 million 

Our method (500-dim, random) 

6.7 

8.5 

11.2 

13.4 

15.6 




Our method (500-dim, PCA) 

6.7 

8.5 

11.4 

13.7 

16.3 




Our method (500-dim, ICA) 

6.7 

8.5 

11.4 

13.7 

16.3 




Our method (900-dim, random) 

6.7 

8.5 

11.4 

13.7 

16.2 




Our method (900-dim, PCA) 

6.7 

8.5 

11.4 

13.7 

16.3 




Our method (900-dim, ICA) 

6.7 

8.5 

11.4 

13.7 

16.3 


WordNet ID: n02077152. There are two classes named/wr seal with different WordNet IDs. 
WordNet ID: n02077658. 





























Table 5; Top-/c accuracy in ImageNet ILSVRC2012 validation set (%) 


Test Set 

#Classes 

#Images 

Method 

Top-1 

Top-2 

Top-5 

Top-10 

Top-20 




Softmax baseline (1000-dim) 

55.6 

67.4 

78.5 

85.0 

- 




DeViSE (500-dim) 

53.2 

65.2 

76.7 

83.3 

- 




ConSE (500-dim) 

54.3 

61.9 

68.0 

71.6 

- 




Our softmax baseline (1000-dim) 

67.1 

78.8 

87.9 

92.2 

95.2 




Our method (100-dim, random) 

67.0 

74.6 

77.8 

79.1 

80.4 




Our method (100-dim, PCA) 

67.0 

76.9 

84.6 

88.5 

91.5 

Seen 

1000 

50000 

Our method (100-dim, ICA) 

67.0 

76.9 

84.6 

88.5 

91.5 




Our method (500-dim, random) 

67.1 

77.3 

83.5 

85.4 

86.6 




Our method (500-dim, PCA) 

67.1 

78.2 

86.2 

89.4 

91.2 




Our method (500-dim, ICA) 

67.1 

78.2 

86.2 

89.3 

91.2 




Our method (900-dim, random) 

67.1 

78.3 

86.0 

88.6 

90.1 




Our method (900-dim, PCA) 

67.1 

78.5 

86.6 

89.8 

91.7 




Our method (900-dim, ICA) 

67.1 

78.4 

86.5 

89.8 

91.7 


6 Discussion and Conclusion 
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The outputs of a neural network contains rich information. 
It has been claimed that one can determine a neural network 
architecture by observing its outputs given arbitrary inputs 
iFefferman and Markel, 1994] , Also, it has been shown that 
one can reconstruct the whole image to some degree with only 
its CNN outputs I Dosovitskiy and Brox, 2015) . And smooth 
regularization on the output distribution of a neural network 
can help in reducing generalization error in both supervisd 
and semi-supervised settings iMiyato et ai, 20131 . 

CNN achieves the state-of-the-art results on many com¬ 
puter vision tasks such as image classification and ob¬ 
ject detection. However, despite many efforts of visu¬ 
alizing and understanding CNN ||Zeiler and Fergus, 2014 


Simonyan et ai, 2014[ |Zhou ef a/., 2015| , it still reminds 


black-box method. In this paper, we attempted to understand 
CNN by unsupervised learning. CNN was trained with only 
one-hot targets, which means we assumed object classes are 
equally similar. We never told CNN which classes more sim¬ 
ilar. But unsupervised learning on CNN outputs reveals the 
visual similarity of object classes. We hope this finding can 
shed some lights on the object representation in CNN. 

We also showed that there is a gap between the visual sim¬ 
ilarity of object classes in CNN and the semantic similarity 
of object classes in our knowledge graph. Therefore, a bridge 
should be built, in order to achieve consistent mapping be¬ 
tween visual and semantic representations. 

Supervised learning alone cannot deal with unseen classes 
since there is no training data. By using external knowledge 
and unsupervised learning algorithms, we can leverage su¬ 
pervised learning so as to make reasonable predictions on the 
unseen classes while maintaining the compatibility with the 
seen classes, that is, zero-shot learning. In this paper, we pro¬ 
posed a new zero-shot learning method, which achieves the 
state-of-the-art results on the ImageNet of over 20000 classes. 
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